2018-01-12 / FMA sub-sampling

  • Problem statement:

    • Input:

      • C csv files
      • each file has n rows. Each row in file c encodes the prediction for class c on a 1sec segment.
      • A target number k
      • Target fractions for class representations p[c].
    • Output:

      • A set of k clips, each 10 seconds in duration
      • Aggregate predicted likelihoods for each class c on each clip k
      • Each class c has aggregate likelihood at least p[c] * k
  • Method:
    1. drop edge effects from the beginning and end of tracks: remove the first and last frames from each track.
    2. window the frame observations into 10sec clips with aggregate labels
    3. threshold the aggregate likelihoods to binarize the representation
    4. subsample the 10sec clips using entrofy
  • Questions:
    • How should likelihoods be aggregated within a segment?
      • Mean? Max? Quartile?
      • Mean makes sense from the perspective of random frame sampling
      • Quartile makes sense wrt sparse events
      • Max makes sense wrt extremely sparse events
    • How should likelihoods be thresholded? 0.5? Empirical average over X?
      • $p[y] = \sum_x p[y|x] * p[x] \approx \sum_{x \in X} p[y|x] /|X| $
      • But that doesn't matter really. Threshold should be bayes optimal (=> 0.5)
    • What's the target number of positives per class k * p[c]?
      • Maybe that should be determined by the base rate estimation p[y]?
  • Next step: Question scheduling on CF.
    • Idea: cluster the tracks according to aggregated likelihood vectors
      • Or maybe by their thresholded likelihoods?
    • Set the number of clusters to be relatively large (say, 23^2 ~= 512)
    • When generating questions for an annotator, assign them to a cluster and only generate questions from that cluster
    • Reasoning: this will keep the labels consistent from one question to the next
  • UPDATE:
    • Windowing and aggregation is happening upstream of this
    • Aggregation is max over the middle 8 frames

2018-01-19

  • Eric has provided the per-fragment aggregated estimates as one giant table
  • So what are our entrofy parameters?

    • attribute thresholds
      • Do we only do <>0.5?
      • Or break likelihood into quartiles?
      • Sounds like quartiles are the way to go
    • target proportions per class?
      • we can try to preserve the empirical distribution
      • or a biased distribution achieved by grouping on the track ids?
      • or uniform?
      • Uniform across quartiles for each instrument
    • output set size?
      • 20-50 positives per instrument?
      • say, 16 * 4 * n_classes
      • Maybe round up to 1K to start
  • If we only want one example per track, we can make an aux categorical column that's the track index, and set the target number to 1

2018-02-02

  • Turns out we didn't get the data transferred in time on 01/19, so still waiting
  • output set size: 500-1000 positives per class
  • try both hard threshold and quartile sampling

In [1]:
import numpy as np
import pandas as pd
import entrofy

In [2]:
import matplotlib.pyplot as plt

In [3]:
%matplotlib nbagg

In [50]:
df = pd.read_csv('/home/bmcfee/data/vggish-likelihoods-a226b3-maxagg10.csv.gz', index_col=0)

In [51]:
df.head(5)


Out[51]:
accordion bagpipes banjo bass cello clarinet cymbals drums flute guitar ... mandolin organ piano saxophone synthesizer trombone trumpet ukulele violin voice
000002_0000 0.01542 0.008608 0.010215 0.035007 0.008873 0.00893 0.086853 0.671350 0.021807 0.135010 ... 0.006079 0.011073 0.084341 0.015115 0.781432 0.012166 0.025021 0.044818 0.067646 0.999691
000002_0001 0.01542 0.008608 0.010215 0.076214 0.008873 0.00893 0.086853 0.630533 0.021807 0.244505 ... 0.006079 0.011073 0.084341 0.015115 0.781432 0.012166 0.025021 0.044818 0.067646 0.999691
000002_0002 0.01542 0.008608 0.010215 0.076214 0.008873 0.00893 0.089177 0.858667 0.021807 0.244505 ... 0.006079 0.011073 0.084341 0.015115 0.188291 0.012166 0.025021 0.044818 0.067646 0.999691
000002_0003 0.01542 0.008608 0.010215 0.076214 0.004974 0.00893 0.089177 0.858667 0.012667 0.244505 ... 0.003388 0.009051 0.040380 0.009120 0.131694 0.005950 0.014247 0.044818 0.067646 0.999691
000002_0004 0.01542 0.008608 0.009334 0.076214 0.004974 0.00893 0.089177 0.858667 0.012667 0.244505 ... 0.003388 0.017866 0.078745 0.009120 0.204007 0.005950 0.014247 0.028634 0.088025 0.999691

5 rows × 23 columns


In [54]:
(df >= 0.5).describe().T.sort_values('freq')


Out[54]:
count unique top freq
guitar 29620525 2 True 16736215
drums 29620525 2 True 19005563
voice 29620525 2 True 20766522
synthesizer 29620525 2 False 21567964
violin 29620525 2 False 27651523
piano 29620525 2 False 28699240
mallet_percussion 29620525 2 False 29007329
flute 29620525 2 False 29105788
bass 29620525 2 False 29150189
cello 29620525 2 False 29165508
saxophone 29620525 2 False 29211328
organ 29620525 2 False 29290712
accordion 29620525 2 False 29428347
harmonica 29620525 2 False 29451312
bagpipes 29620525 2 False 29452449
trumpet 29620525 2 False 29461391
trombone 29620525 2 False 29468003
cymbals 29620525 2 False 29489953
ukulele 29620525 2 False 29493362
harp 29620525 2 False 29556563
banjo 29620525 2 False 29563360
mandolin 29620525 2 False 29587198
clarinet 29620525 2 False 29598953

In [71]:
df.median()


Out[71]:
accordion            0.007389
bagpipes             0.006224
banjo                0.004509
bass                 0.155189
cello                0.016092
clarinet             0.006047
cymbals              0.078683
drums                0.647003
flute                0.022228
guitar               0.569506
harmonica            0.008434
harp                 0.005587
mallet_percussion    0.044757
mandolin             0.004615
organ                0.033265
piano                0.119241
saxophone            0.019062
synthesizer          0.292279
trombone             0.012580
trumpet              0.015803
ukulele              0.011841
violin               0.086547
voice                0.819313
dtype: float64

Binary thresholding


In [55]:
N_OUT = 23 * 100

In [56]:
mappers = {col: entrofy.mappers.ContinuousMapper(df[col],
                                                 prefix=col,
                                                 n_out=2,
                                                 boundaries=[0.0, 0.5, 1.0]) for col in df}

In [ ]:
idx, score = entrofy.entrofy(df, N_OUT, mappers=mappers,
                             seed=20180205,
                             quantile=0.05,
                             n_trials=10)

In [64]:
df.loc[idx].head(10)


Out[64]:
accordion bagpipes banjo bass cello clarinet cymbals drums flute guitar ... mandolin organ piano saxophone synthesizer trombone trumpet ukulele violin voice
000046_0053 0.062583 0.915700 0.018637 0.162992 0.040727 0.018444 0.100536 0.534406 0.033820 0.976719 ... 0.010782 0.714110 0.300557 0.087529 0.385306 0.079978 0.075567 0.030833 0.160921 0.794626
000311_0001 0.006875 0.001383 0.002581 0.696345 0.006392 0.003727 0.033844 0.651003 0.023251 0.955172 ... 0.003475 0.034006 0.156438 0.015048 0.539802 0.004837 0.006913 0.021388 0.040315 0.907296
000341_0068 0.024864 0.001311 0.028881 0.163791 0.020460 0.008838 0.041862 0.584283 0.018548 0.609002 ... 0.026970 0.123292 0.663510 0.030823 0.886328 0.012027 0.011733 0.251555 0.057407 0.970078
000368_0158 0.010230 0.005692 0.028237 0.502167 0.014982 0.004697 0.082778 0.646268 0.029445 0.822505 ... 0.018546 0.076291 0.458668 0.025077 0.938266 0.006261 0.009553 0.034601 0.047432 0.670003
000402_0003 0.027663 0.014083 0.009137 0.107472 0.176484 0.018028 0.028532 0.300647 0.058136 0.271525 ... 0.004556 0.554254 0.270120 0.031469 0.611561 0.038181 0.039272 0.010984 0.643716 0.946539
000644_0672 0.012434 0.014400 0.018587 0.270604 0.256813 0.067379 0.045278 0.427513 0.132918 0.987838 ... 0.018609 0.162592 0.245520 0.445257 0.072400 0.703359 0.640719 0.121939 0.180639 0.834285
001023_0083 0.013796 0.005061 0.003343 0.416631 0.460000 0.835826 0.005285 0.134761 0.307062 0.477020 ... 0.004188 0.093485 0.137888 0.544438 0.066581 0.091138 0.089995 0.009233 0.199989 0.549448
001033_0024 0.023249 0.012236 0.011249 0.184627 0.028087 0.071508 0.116394 0.717136 0.437962 0.626129 ... 0.010670 0.071872 0.229724 0.039138 0.511643 0.013548 0.028309 0.046583 0.179872 0.632515
001145_0137 0.003957 0.000918 0.000558 0.203969 0.146485 0.006395 0.010076 0.124820 0.023060 0.859799 ... 0.000823 0.567757 0.338795 0.018945 0.538661 0.009014 0.007018 0.002454 0.139204 0.894864
001214_1090 0.001480 0.000647 0.000639 0.587898 0.009738 0.002642 0.139987 0.929737 0.004773 0.968694 ... 0.000823 0.011013 0.118221 0.004195 0.743730 0.003819 0.004699 0.002008 0.013668 0.075813

10 rows × 23 columns


In [65]:
(df.loc[idx] >= 0.5).describe().T.sort_values('freq')


Out[65]:
count unique top freq
guitar 2300 2 True 1153
voice 2300 2 False 1177
drums 2300 2 False 1325
synthesizer 2300 2 False 1499
violin 2300 2 False 1526
piano 2300 2 False 2008
cello 2300 2 False 2011
mallet_percussion 2300 2 False 2018
flute 2300 2 False 2023
saxophone 2300 2 False 2032
bass 2300 2 False 2036
trumpet 2300 2 False 2069
accordion 2300 2 False 2077
organ 2300 2 False 2083
harmonica 2300 2 False 2091
trombone 2300 2 False 2093
bagpipes 2300 2 False 2111
ukulele 2300 2 False 2132
cymbals 2300 2 False 2137
banjo 2300 2 False 2208
harp 2300 2 False 2211
mandolin 2300 2 False 2239
clarinet 2300 2 False 2271


In [69]:
!pwd


/home/bmcfee/git/cosmir/dev-set-builder/notebooks

In [68]:
idx.to_series().to_json('subsample_idx.json')

Multi-valued thresholds


In [ ]:
mappers = {col: entrofy.mappers.ContinuousMapper(df[col], n_out=4,
                                                 boundaries=[0.0, 0.25, 0.5, 0.75, 1.0]) for col in df}

In [ ]:


In [3]:
idx, score = entrofy.entrofy(df, 1000, mappers=mappers, n_trials=100)